UNITED STATES PATENT APPLICATION FOR 



en 
a 
m 



ADAPTING BAYESIAN NETWORK PARAMETERS 
ON-LINE IN A DYNAMIC ENVIRONMENT 



Inventors : 
Ira Cohen 
Alexandre Bronstein 
Marsha Prescott Duro 



CERTIFICATE OF MAILING BY "EXPRESS MAIL" 
UNDER 37 C.F.R, § 1.10 

"Express Mail" mailing label number: EK911714e79US 
Date of Mailing: 12-18-2001 

I hereby certify that this correspondence is 
being deposited with the United States Postal 
Service, utilizing the "Express Mail Post Office to 
Addressee" service addressed to Assistant 
Commissioner for Patents, Washington, D.C. 20231 and 
mailed on the above Date of Mailing with the above 
"Express Mail" mailing label number. 




Paul'H. Horstmknn, ^Reg.'No. 3 6,167 
Signature Date: -tSf-/) \ 



BACKGROUND OF THE INVENTION 



Field of Invention 

The present invention pertains to the field of 
automated reasoning. More particularly, this 
invention relates to Bayesian networks in automated 
reasoning . 

Art Back grpund 

Bayesian networks are commonly used for 
automated reasoning in a wide variety of 
applications. Typically, Bayesian networks are used 
to model an underlying system or environment of 
interest. For example, Bayesian networks may be used 
to model biological systems including humans and 
animals, electrical systems, mechanical systems, 
software systems, business transaction systems, etc. 
Bayesian networks may be useful in a variety of 
automated reasoning tasks including diagnosing 
problems with an underlying system, determining the 
health of an underlying system, and predicting future 
events in an underlying system to name a few 
examples . 

A typical Bayesian network is a graph structure 
having a set of nodes and interconnecting arcs that 
define parent-child relationships among the nodes. A 
Bayesian network also includes a set of Bayesian 
network parameters which are associated with the 
nodes of the graph structure. Typically, the nodes 
of a Bayesian network are associated with events or 
characteristics of the underlying modeled environment 
and the Bayesian network parameters usually indicate 
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partial causalities among the events or 
characteristics associated with the nodes. The 
Bayesian network parameters are commonly contained in 
conditional probability tables associated with the 
nodes. Typically, a Bayesian network describes the 
joint probability of random variables each of which 
is represented by a node in the graph. 

The Bayesian network parameters are commonly 
obtained from experts who possess knowledge of the 
behaviors or characteristics of the underlying 
modeled environment. Alternatively, the Bayesian 
network parameters may be obtained using observations 
of the underlying modeled environment . 
Unfortunately, environments may exist for which 
experts are unavailable or prohibitively expensive. 
In addition, environments may exist for which 
observation data is scarce or in which the underlying 
environment changes and renders past experience or 
obser-vations obsolete. 
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SUMMARY OF THE INVENTION 



A method is disclosed for adapting a Bayesian 
network. A Bayesian network may be adapted to 
changes in the underlying modeled environment using 
the present techniques even when observation data is 
relatively scarce and in an on-line dynamic 
environment . The present method includes determining 
a set of parameters for the Bayesian network, for 
example, initial parameters, and then updating the 
parameters in response to a set of observation data 
using an adaptive learning rate. The adaptive 
learning rate responds to any changes in the 
underlying modeled environment using minimal 
observation data. 

Other features and advantages of the present 
invention will be apparent from the detailed 
description that follows. 
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BRIEF DESCRIPTION OF THE DRAWINGS 



The present invention is described with respect 
to particular exemplary embodiments thereof and 
reference is accordingly made to the drawings in 
which: 

Figure 1 shows an on-line adapter which adapts a 
Bayesian network according to the present teachings; 

Figure 2 shows a method for adapting Bayesian 
network parameters according to the present 
teachings ; 

Figure 3 shows an example Bayesian network for 
illustrating the present techniques; 

Figure 4 shows a method for determining an 
adaptive learning rate according to the present 
teachings . 
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DETAILED DESCRIPTION 



Figure 1 shows an on-line adapter 56 which 
adapts a Bayesian network 52 according to the present 
teachings. The Bayesian network 52 performs 
automated reasoning with respect to an on-line 
environment 50. The on-line adapter 56 obtains a set 
of observation data 54 from one or more elements of 
the on-line environment 50 and adapts the parameters 
of the Bayesian network 52 in response to the 
observation data 54. 

The present techniques enables adaptation of the 
parameters of the Bayesian network 52 in response to 
changes in the on-line environment 50 even when the 
observation data 54 is relatively scarce and/or when 
some of values in the observation data 54 from 
elements of the on-line environment 50 are 
unavailable. For example, at any given time values 
from a subset of the hardware/software elements of 
the on-line environment 50 may be unavailable due to 
hardware/ software failures and/or due to the nature 
of events in the on-line environment 50. 

The on-line environment 50 may be the 
hardware /software elements of an email system, an e- 
commerce system, a database system, or any type of 
distributed application to name a few examples. 

Figure 2 shows a method for adapting Bayesian 

network parameters according to the present 
teachings. At step 102, a set of initial Bayesian 
network parameters are determined. At step 104, the 
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Bayesian network parameters are updated in response 
to a set of observation data using an adaptive 
learning rate. The step 104 may be repeated for each 
of a set of observation data records. 

Figure 3 shows an example Bayesian network 100 
for illustrating the present techniques. The 
Bayesian network 100 includes a set of nodes 10-14. 
The node 10 corresponds to a variable (EXCHANGE_A) 
which indicates whether a stock exchange A is up (U) 
or down (D) in terms of change in value. The node 12 
corresponds to a variable (EXCHA]SrGE_B) which 
indicates whether a stock exchange B is up or down. 
The node 14 corresponds to a variable (MY_STOCKS) 
which indicates whether the value of a set of stocks 
associated with the stock exchanges A and B and held 
by a particular individual is up or down. 

The nodes 10-14 have associated conditional 
probability tables 20-24, respectively, for holding 
the parameters of the Bayesian network 100. The 

conditional probability tables 20-24 are written with 
a set of initial Bayesian network parameters 
determined at step 102 . 

An example set of initial Bayesian network 
parameters in the conditional probability table 20 is 
as follows: 



entry 


D 


U 


0 


0 .2 


0 . 8 



Entry 0 in the conditional probability table 2 0 
indicates that the probability that the variable 
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EXCHA]S[GE_A = D is 2 0 percent and the probability that 
the variable EXCHANGE_A = U is 80 percent under all 
conditions . 

An example set of initial Bayesian network 
parameters in the conditional probability table 22 is 
as follows: 



entry 


D 


U 


0 


0.3 


0.7 



Entry 0 in the in the conditional probability 
table 22 indicates that the probability that the 
variable EXCHANGE_B = D is 3 0 percent and the 
probability that the variable EXCHANGE_B = U is 70 
percent under all conditions. 

An example set of initial Bayesian network 
parameters in the conditional probability table 24 
are as follows: 



entry 


EXCHANGE_A 


EXCHANGE_B 


D 


U 


0 


D 


D 


0.5 


0.5 


1 


D 


U 


0.5 


0.5 


2 


U 


D 


0.5 


0.5 


3 


U 


U 


0.5 


0 . 5 



Entry 0 in the in the conditional probability 
table 24 indicate that the probability that the 
variable MY_STOCKS = D is 5 0 percent and that the 
probability that the variable iyiY_STOCKS = U is 5 0 
percent given that EXCHANGE_A =D and EXCHANGE_B =D . 
Similarly, entries 1-3 in the conditional probability 
table 24 indicate that the probability that the 
variable MY_STOCKS = D is 50 percent and that the 
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probability that the variable MY_STOCKS = U is 50 
percent given that all remaining combinations of 
conditions of EXCHANGE_A and EXCHANGE_B . 

At step 104, the Bayesian network parameters 

contained in the conditional probability table 24 are 

updated in response to the following example set of 

observation data: 

EXCHANGE_A = U 
EXCHANGE_B = D 
MY_STOCKS = D 

This example observation data corresponds to 

entry 2 in the conditional probability table 24 with 
the conditions EXCHANGE_A = U and EXCHANGE_B = D. 

Entry 2 is updated at step 104 in response to 
the example observation as follows. The new 
probability that MY_STOCKS = D given that EXCHANGE_A 
= U and EXCHANGE_B = D equals rj+d-n) times the 
previous probability that MY_STOCKS = D given that 
EXCHANGE_A = U and EXCHANGE_B = D, where n is an 
adaptive learning rate which is a number between 0 
and 1. This increases the probablity that MY_STOCKS 
= D given that EXCHANGE_A = U AND EXCHANGE_B = D. 

In addition, the probablity that MY_STOCKS = U 
given that EXCHANGE_A = U and EXCHANGE_B = D is 
decreased at step 104 as follows. The new 
3 0 probability that MY_STOCKS = U given that EXCHANGE_A 
= U and EXCHANGE_B = D equals (l-n) times the 
previous probability that MY_STOCKS = U given that 
EXCHANGE A = U AND EXCHANGE_B = D. 
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The updated conditional probability table 24 
after step 104 is as follows: 



entry 


EXCHANGE_A 


EXCHANGE_B 


D 


u 


0 


D 


D 


0 . 5 


0.5 


1 


D 


U 


0 . 5 


0 . 5 


2 


U 


D 


ri+(l-n)-0.5 


(l-ri)-0.5 


3 


U 


U 


0.5 


0.5 



The following generalized description enables 
adaptation for systems including on-line environments 
in which some of the values in the observation data 
54 are unavailable. In the following, Zi is a node in 
a Bayesian network that takes any value from the set: 

Pa^, is the set of parents of in the Bayesian 
network that take on of the configurations denoted by 
the following: 

{pal, . . . ,paf^} 

For the Bayesian network 100 example, if is 
the node 14 then Pa^ and Paj are the nodes 10 and 12 
and the configurations {paiS . . . } are DD, DU, UD, and 
UU. 

An entry in the conditional probability table 
for the node Z^ is given by the following: 

e,.^=P(Z.=z/|Pa,=pa/) 
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A set of observation data cases D are 
represented as follows: 

D^{Y1, . . . , Yt, . . .} 



The update of the parameters for the conditional 
probability table for the node is achieved by the 
following Tnaxitnization (equation 1) : 

e^argrmaxg [r]L^{Q) -d(e,e) ] 

where L^O) is the normalized log likelihood of the 
data given the network, d(e,e) is a distance between 
the two models and r\ is the learning rate. In one 
embodiment, the distance is the Chi squared distance. 
The maximization is solved under the constraint that 
X,eij^=l for Vi,j . 

For each new observation vector, the parameters 
for all conditional probability tables in a Bayesian 
network may be updated according to the following 
(equation 2) : 



P{paP\yT,e'-^) 



P(pa/|yT,e^-^) ^0 
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otherwise . 

This update process may be referred to as 
stochastic learning in which the teirm 

P(z/,pa/|yT,9^-^) 
P{pa?\yx,e''-^) 

is an instantaneous gradient estimate of the 
constraint optimization problem. 

The learning rate x] may be used to control the 
amount of reliance on past observation data when 
updating the conditional probability table parameters 
of a Bayesian network. As v] approaches 1 the past 
observation data is weighted less and the update of 
the parameters is based more on the present 
observation data. As r\ approaches 0 the parameters 
are updated slowly from the previous parameters. 

The update rule of equation 2 may be rewritten 
as follows (equation 3) assuming a constant learning 
rate r\ : 

0ij;,=X^=(l-ri)Xj^_i+ri'Xj, 
assuming with no loss of generality that 

P(pa/|y,,e'=) -1 

for all t={l, . . .T, . . . } , i.e., the parents are always 
observed in their j'"'" configuration. The assumption 
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is a notational convention and not an actual 
restriction or constraint. 

It is an indicator function and the process {it) 
an independent identically distributed Bernoulli 
random process given as It=l with probability Qijk= c* 
and lt=0 with probability 1-c* (equation 4) where 

c*=P{X.=x^\Pa..=paP) 

is the true conditional probability table entry of 
the Bayesian network. 

Given a discrete Bayesian network S, a sequence 
of full observation vectors D, the update rule given 
in equation 3, and the constraint 0<ri^l, it may be 
shown that the following properties hold: 

Property 1; Xt is a consistent estimate of c*, 

i.e., 

E[X^] = (l-ri) ^)c\ t>0=»lim^_B [Z^] =c* 

where Xq is the initial value set at t=0. 

Property 2; The variance of the estimate Xt is 
finite and follows (equations 6 and 7) : 

Var [XJ = (1 - (1 -ri ) ) => 

t 2-ri 

lim, Var[Xj --i^-c*(l-c*) 
t-" 2-ri 
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Property 3; For t-*°° the following inequality 
holds : 

P{\X -c*\^qa) ^ — 

for any q>0. 

It is apparent that in the mean the online 
update rule of equation 3 approaches true values for 
Bayesian network parameters. The learning rate i] 
controls the rate of convergence. A setting of r\=l 
yields the fastest convergence with the largest 
variance. Smaller values of v] yields slower 
convergence with smaller variance. The variance is 
proportional to r| and remains finite in the limit and 
thus the estimate for a Bayesian network parameter 
oscillates around its true value. The learning rate 
x] may be viewed as a forgetting bias of the learning 
algorithm so that the system forgets the past at an 
exponential rate proportional to rj . 

Property 3 provides the confidence intervals of 
an estimated Bayesian network parameter with respect 
to the variance of the estimate. Property 3 may be 
employed in adapting the learning rate r\ to changes 
in the underlying modeled environment. 

Figure 4 shows a method for adapting the 
learning rate r| according to the present teachings . 
At step 112, the learning rate ri is initially set to 
a value between 0 and 1 . The learning rate r| in one 
embodiment is set to a relatively high value at step 
112. 
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At step 114, an estimate of the Bayesian network 
parameters for an underlying modeled environment is 
determined in response to a set of observation data. 
The estimate of the Bayesian network parameters may 
be obtained using equation 2 given above. 

At step 116, the learning rate r| is increased if 
an error between the estimate obtained at step 114 
and a mean value of the Bayesian network parameters 
for the underlying modeled environment is relatively 
large. A relatively large error may occur, for 
example, when the modeled underlying environment 
changes. In one embodiment, a large error is 
indicated by the inequality stated in property 3 
above . 

Otherwise at step 118, the learning rate n is 
decreased when convergence is reached between the 
estimate determined at step 114 and the mean value of 
the Bayesian network parameters for the underlying 
modeled environment . 

In the following example embodiment, a different 
learning rate is assigned for each set of parameters 
of a Bayesian network, i.e. for each node Zi with 
parents Pa^ and parent's configuration pa^,^ the 
learning rate is denoted as ri^j . Letting T denote the 
total number of observation data, t denote the number 
of times Pai=pai^, and 5t denote the number of times 
Pa^=pai^ since the last time riij, the schedule for 
adapting the learning rate is as follows. 

Initialize the following: 
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Set P [Xi=Xi'"lPai=pai^] =e^ij„ to an initial value for 
k=l, . . . , 

Set riij to a value between 0 and 1. For example 
a high value may be initially set. 
Set t, 5t=0. 

Given an observation vector y^, if Pai=pa^^ then 
do the following: 

Step 1: Estimate 9*=*^^], using equation 3 where r\ 
is replaced by rii-, . 

Step 2: If |e'%„-E[e'=^^3J |>qa'^"^ij then 

increase riij : set riij=riij -m, \\m>l 
set 5t=0 

Else if (l-ri.j)^'<a \\a«l 
decrease riij = set riij=r|ij "m"^ 
set 5t=0 
Else set 5t=5t+l 
Step 3 : Get the next observation and repeat 
steps 1-2. 

The parameters q, m, and a are adjustable. The 
parameter q determines the confidence in the decision 
to increase n . The parameter a is a threshold 
reflecting the acceptable convergence of the 
parameters. E [e^^'^j^] and a*''-\j are the mean and 
variance of the estimated parameter. The mean may be 
estimated by a running average to be reset each time 
riij is increased. The variance may be estimated 
according to equation 6 with t replaced by 5t and 
with c*=0.5 which provides the worst case estimate of 
the variance . 

It may be shown that the rate of decrease of Hij 
is proportional to l/t„ where t^ is the number of 
times Pai=pa^^ until the n'th reduction of r\. This 
rate is consistent with an optimal annealing rate 
1/t. In the following analysis it is assumed that Hij 
is not increased at any point in time. If riij is 
increased, t is reset to zero and the analysis still 
holds . 
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Using the reduction rule outlined above, with 

r]ij(0)=rio and for tn<t<tnH.i, the learning rate r|ij (t) is 
bounded by the following: 

m-l t^+K+log(a-^)i2 ^"1 t^-^^-f^ 



K=^^^^^,0<a<l,m>l,n6N. 
n. im-l) 



The bounds become tighter as t^, increases. rii3 
is reduced at discrete steps which increase in length 
as t increases. Therefore, riij will have longer 
intervals at which it remains constant but at the end 
of the interval it reduces as 1/t. When the error 
between the current estimate and its mean value 
remains small, riij reduces with the optimal schedule. 
At the limit the estimated Bayesian network 
parameters converge to the target parameters with 
zero-error. If the error becomes large, rii^ increases 
which increases the ability to adapt faster to 
changes in the modeled environment or break out of a 
local maxima. The time origin is effectively shifted 
forward each time riij is increased, which yields a 
method that is substantially insensitive to an 
absolute time origin. 

The foregoing detailed description of the 
present invention is provided for the purposes of 
illustration and is not intended to be exhaustive or 
to limit the invention to the precise embodiment 
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disclosed. Accordingly, the scope of the present 
invention is defined by the appended claims. 



0 



O 



Attorney Docket No. 10006656 



